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Abstract 

We present a unified mathematical framework for analyz- 
ing the tradeoffs between parallelism and storage allocation 
within a parallelizing compiler. Using this framework, we 
show how to find the best storage mapping for a given sched- 
ule, the best schedule for a given storage mapping, and the 
best storage mapping that is valid for all legal schedules. 
Our technique combines affine scheduling techniques with 
occupancy vector analysis, and incorporates general affine 
dependencies across statements and loop nests. We formu- 
late the constraints imposed by the data dependencies and 
the storage mapping as a set of linear inequalities, and ap- 
ply numerical programming techniques to efficiently solve 
for the best occupancy vector. We consider our method to 
be a first step towards automating a procedure that finds the 
optimal tradeoff between parallelism and storage space. 

1 Introduction 

It remains an important and relevant problem in computer 
science to automatically find an efficient mapping of a se- 
quential program onto a parallel architecture. Though there 
are many heuristic algorithms in practical systems and par- 
tial or suboptimal solutions in the literature, a theoreti- 
cal framework that can fully describe the entire problem 
and find the optimal solution is still lacking. The difficulty 
stems from the fact that multiple inter-related costs and 
constraints must be considered simultaneously to obtain an 
efficient executable. 

While exploiting the parallelism of a program is an im- 
portant step towards achieving efficiency, gains in paral- 
lelism are often overwhelmed by other costs relating to data 
locality, synchronization, and communication. In particu- 
lar, with the widening gap between clock speed and mem- 
ory latency, and with modern memory systems becoming in- 
creasingly hierarchical, the amount of storage space required 
by a program can have a drastic effect on its performance. 
Nonetheless, parallelizing compilers often employ varying 
degrees of array expansion [9, 5, 1] to eliminate element- 
level anti and output dependencies, thereby adding large 
amounts of storage that may or may not be justified by the 
resulting gains in parallelism. 
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Thus, compilers must be able to analyze the tradeoffs 
between parallelism and storage requirements in order to 
arrive at an efficient executable. In this paper, we intro- 
duce a unifying mathematical framework that incorporates 
both schedule constraints (restricting when statements can 
be executed) and storage constraints (restricting where their 
results can be stored). The framework is capable of han- 
dling any loop-based program with affine array references 
and loop bounds, making it applicable to many scientific 
applications. 

Using this framework, we also present solutions to three 
important scheduling problems. Namely, we show how to 
determine 1) the best storage mapping for a given schedule, 
2) the best schedule for a given storage mapping, and 3) the 
best storage mapping that is valid for all legal schedules. 
Our solutions are optimal (in a sense that is defined below), 
and our method is practical in that it reduces to a linear 
program that can be efficiently solved with standard tech- 
niques. We believe that these solutions represent the first 
step towards automating a procedure that finds the optimal 
compromise between parallelism and storage space. 

2 Abstract Problem 

To motivate our approach, we consider simplified descrip- 
tions of the scheduling problems faced by a parallelizing 
compiler. We are given a directed acyclic graph G = (V, E). 
Each vertex v € V represents a dynamic instance of an in- 
struction; a value will be produced as a result of executing v. 
Each edge («i, V2) € E represents a dependence of «2 on the 
value produced by vi. Thus, each edge («i,«2) imposes the 
schedule constraint that vi be executed before «2, and the 
storage constraint that the value produced by vi be stored 
until the execution time of «2 • 

Our task is to output (O, m), where O is a function map- 
ping each operation v € V to its execution time, and m is 
the maximum number of values that we need to store at a 
given time. Parallelism is expressed implicitly by assigning 
the same execution time to multiple operations. To sim- 
plify the problem, we ignore the question of how the values 
are mapped to storage cells and assume that live values are 
stored in a fully associative map of size m. How, then, might 
we go about choosing Q and m? 

2.1 Choosing a Store Given a Schedule 

The first problem is to find the optimal storage mapping for 
a given schedule. That is, we are given Q and choose m such 



A[][] = new int[n][m] 



for j = 1 to m 
for i = 1 to n 



A[i][j] 



f(A[i-2][j-l], 
A[i][j-1], 
A[i+l][j-l]) 



(SI) 



Figure 1: Original code for Example 1. 



that 1) (0, m) respects the storage constraints, and 2) m is 
as small as possible. 

This problem is orthogonal to the traditional loop paral- 
lelization problem. After selecting the instruction schedule 
by any of the existing techniques, we are interested in iden- 
tifying the best storage allocation. That is, with schedule- 
specific storage optimization we can build upon the perfor- 
mance gains of any one of the many scheduling techniques 
available to the parallelizing compiler. 

2.2 Choosing a Schedule Given a Store 

The second problem is to find the optimal schedule for a 
given size of the store, if any valid schedule exists. That is, 
we are given m and choose such that 1) (0, m) respects 
the schedule and storage constraints, and 2) assigns the 
earliest possible execution time to each instruction. Note 
that if m is too small, there might not exist a that respects 
the constraints. 

This is a very relevant problem in practice because of 
the stepwise, non-linear effect of storage size on execution 
time. For example, when the storage required cannot be 
accommodated within the register file or the cache, and has 
to resort to the cache or the external DRAM, respectively, 
the cost of storage increases dramatically. Further, since 
there are only a few discrete storage spaces in the memory 
hierarchy, and their size is known for a given architecture, 
the compiler can adopt the strategy of trying to restrict the 
store to successively smaller spaces until no valid schedule 
exists. Once the storage is at the lowest possible level, the 
schedule could then be shortened, having a more continuous 
and linear effect on efficiency than the storage optimization. 
In the end, we end up with a near-optimal storage allocation 
and instruction schedule. 

2.3 Choosing a Store for all Schedules 

The final problem is to find the optimal storage mapping 
that is valid for all legal schedules. That is, we are given a 
(possibly infinite) set * = {0i, 02, . . . }, where each in * 
respects the schedule constraints. We choose m such that 
1) VO 6 \P, (0, m) respects the storage constraints, and 2) 
m is as small as possible. 

A solution to this problem allows us to have the mini- 
mum storage requirements without sacrificing any flexibility 
of our scheduling. For instance, we could first apply our stor- 
age mapping, and then arrange the schedule to optimize for 
data locality, synchronization, or communication, without 
worrying about violating the storage constraints. 

Such flexibility could be critical if, for example, we want 
to apply loop tiling [10] in conjunction with storage opti- 
mization. If we optimize storage too much, tiling could 
become illegal; however, we sacrifice efficiency if we don't 
optimize storage at all. Thus, we optimize storage as much 
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Figure 2: Iteration space diagram for Example 1. Given the 
schedule where each row is executed in parallel, our method iden- 
tifies (0, 2) as the shortest valid occupancy vector. 



A[] = new int [2*n+l] 

for j = 1 to m 
for i = 1 to n 

A[2*i + (j mod 2)] = f (A[i-2] [( j-1) mod 2] (SI) 
A[i][(j-1) mod 2], 
A[i+l][(j-l) mod 2]) 



Figure 3: Transformed code for Example 1. The occupancy 
vector is (0,2). 



as we can without invalidating a schedule that was valid 
under the original storage mapping. 

More generally, if our analysis indicates that certain sched- 
ules are undesirable by any measure, we could add edges to 
the dependence graph and solve again for the smallest m 
sufficient for all the remaining candidate schedules. In this 
way, m provides the best storage option that is legal across 
the entire set of schedules under consideration. 

3 Concrete Problem 

Unfortunately, the domain of real programs does not lend 
itself to the simple DAG representation as presented above. 
Primarily, loop bounds in programs are often specified by 
symbolic expressions instead of constants, thereby yielding a 
parameterized and infinite dependence graph. Furthermore, 
even when the constants are known, the problem sizes are 
too large for schedule and storage analysis on a DAG, and 
the executable grows to an infeasible size if a static instruc- 
tion is generated for every node in the DAG. 

Accordingly, we make two sets of simplifying assump- 
tions to make our analysis tractable. The first concerns 
the nature of the dependence graph G and the scheduling 
function 0. Instead of allowing arbitrary edge relationships 
and execution orderings, we restrict our attention to affine 
dependencies and affine schedules. The second assumption 
concerns our approach to the optimized storage mapping. 
Instead of allowing a fully associative map of size m, as 
above, we employ the occupancy vector as a mechanism of 
storage reuse. In the following sections, we discuss these 
assumptions in the context of an example. 

3.1 Program Domain 

Primarily, we require an affine description of the dependen- 
cies of the program. This formulation gives an accurate de- 
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Figure 4: Iteration space diagram for Example 1. Given an occupancy vector of (0, 2), our method identifies the range of valid schedules. 
An affine schedule will sweep across the space, executing a line of iterations at once. If this line falls within the gray region (as on the 
left), then the schedule is valid for the occupancy vector. If this line falls within the striped region (as on the right) then the schedule 
is valid for some other occupancy vector. The schedule at right is invalid because of the parallel read/write occuring from the tip of the 
occupancy vector. 



scription of the dependencies of programs with static control 
flow and affine index expressions [6] and can be estimated 
conservatively for others. As will become clear below, re- 
stricting our attention to affine dependencies allows us to 
model the infinite dependence graph as a finite set of pa- 
rameters, which is central to the method. 

In this paper, we further assume that the iteration space 
of each statement exactly corresponds with the data space 
of the array written by that statement. That is, for array 
references appearing on the left hand side of an assignment, 
the expression indexing the i'th dimension of the array is the 
index variable of the i'th enclosing loop (this is formalized 
below). While techniques such as array expansion [5] can 
be used to convert programs with affine dependencies into 
this form, our analysis will be most useful in cases where 
an expanded form was obtained for other reasons (e.g., par- 
allelism detection) and one now seeks to reduce storage re- 
quirements. 

We will refer to the example in Figure 1, borrowed from 
[18]. It clearly falls within our input domain, as the depen- 
dencies have constant distance, and iteration (i,j) assigns to 
-A[«][j]. This example could represent a computation where 
a one-dimensional array A[i] is being updated over a time 
dimension j, and the intermediate results are being stored. 
We assume that only the element A[n][m] is used outside the 
loop; the other values are only temporary. 

3.2 Occupancy Vectors 

We arrive at a simple model of storage reuse via the oc- 
cupancy vector [18]. Informally, an occupancy vector for a 
given array defines equivalence classes over the locations of 
the array. Two locations of an array are stored in the same 
location following a storage transformation if and only if 
they are separated by an integral multiple of the occupancy 
vector: 

Definition 1 A ' is the result of transforming A under the 
occupancy vector v if for all pairs of locations (h,fa) of A: 
h = fa + k * v for some integer k if and only if h and fa are 
stored in the same location in A '. 

We say that an occupancy vector v is valid for an array A 
with respect to a given schedule if transforming A under 



v everywhere in the program does not change the semantics 
when the program is executed according to 0. 

Given an occupancy vector, we implement the storage 
transformation using the technique of [18] in which the orig- 
inal data space is projected onto the hyperplane that is per- 
pendicular to the occupancy vector. If an occupancy vector 
intersects multiple (integral) points of the data space, then 
modulation must be used to distinguish these points in the 
transformed array. 

Occupancy vector transformations are useful for reduc- 
ing storage requirements when many of the values stored 
in the array are temporary. Generally, shorter occupancy 
vectors lead to smaller storage requirements because more 
elements of the original array are coalesced into the same 
storage location. However, the shape of the array also has 
the potential to influence the transformed storage require- 
ments. Throughout this paper, we assume that the shapes 
of arrays have second-order effects on storage requirements, 
and refer to the "best" occupancy vector as that which is 
the shortest. 

We are now in a position to consider our occupancy vec- 
tor analysis as applied to Example 1. First, assume that we 
have chosen to execute each row in parallel so as to have 
the shortest schedule. What is the best storage mapping for 
this schedule? Our method can identify (0, 2) as the shortest 
occupancy vector for this schedule (see Figure 2), yielding 
the code in Figure 3. 

Secondly, consider the case where we become interested 
in adding some flexibility to our scheduling. What sched- 
ules can we consider without changing the storage mapping 
induced by the occupancy vector of (0, 2) above? As illus- 
trated in Figure 4, our method can identify all legal affine 
schedules for the occupancy vector of (0, 2). We could then 
use affine scheduling techniques [7] to choose amongst these 
schedules according to other criteria. 

3.3 Affine Occupancy Vectors 

Finally, we might inquire as to the shortest occupancy vector 
that is valid for all affine schedules in Example 1. An affine 
schedule is one where each dynamic instance of a statement 
is executed at a time that is an affine expression of the loop 
indices, loop bounds, and compile-time constants. To ad- 
dress the problem, then, we need the notion of an Affine 
Occupancy Vector: 
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Figure 5: Iteration space diagram for Example 1. Here the hol- 
low arrow denotes an Affine Occupancy Vector that is valid for 
all legal affine schedules. The gray region indicates the slopes 
at which a legal affine schedule can sweep across the iteration 
domain. 



A[] = new int[2*n+m] 

for j = 1 to m 
for i = 1 to n 

A[2*i-j+m] = f (A[2*(i-2)-(j-l)+m] , (SI) 

A[2*i-(j-l)+m] , 
A[2*(i+l)-(j-l)+m]) 



Figure 6: Transformed code for Example f. The AOV is (1,2). 

Definition 2 An occupancy vector v for array A is an Affine. 
Occupancy Vector (AOV) if it is valid with respect to every 
affine schedule O that respects the schedule constraints of the 
original program. 

Note that, in contrast to the Universal Occupancy Vector 
of [18], an AOV need not be valid for all schedules; rather, 
it only must be valid for affine ones. Almost all the sched- 
ules found in practice are affine, since any FOR loop with 
constant increment defines a schedule that is affine in its 
loop indices. In this paper, we further relax the definition 
of an AOV to those occupancy vectors which are valid for 
all one-dimensional affine schedules. We have yet to extend 
our method to higher dimensional schedules. 

We also observe that, if tiling is legal in the original 
program, then tiling is legal after transforming each array 
in the program under one of its AOV's. This follows from 
the fact that two loops are tilable if and only if they can be 
permuted without affecting the semantics of the program 
[10]. Since each permutation of the loops corresponds to a 
given affine schedule and the AOV is valid with respect to 
both schedules, the AOV transformation is also valid with 
respect to a tiled schedule. 

Returning to our example, we find using our method 
that (1,2) is a valid AOV (see Figure 5). Any affine one- 
dimensional schedule that respects the dependencies in the 
original code will give the same result when executed with 
the transformed storage. 

4 The Method 

4.1 Notation 

We adopt the following notation: 



• An iteration vector i contains the values of surrounding 
loop indices at a given point in the execution of the 
program. 

• The structural parameters n, of domain J\f, represent 
loop bounds and other parameters that are unknown 
at compile time, but that are fixed for any given exe- 
cution of the program. 

• There are m s statements Si . . . S ma in the program. 
Each statement S has an associated polyhedral domain 
T>s, such that Vi 6 T>s, there is a dynamic instance 
S(i) of statement S at iteration i during the execution 
of the program. 

• With each statement S is associated a scheduling func- 
tion ©s which maps the instance of S on iteration i 
to a scalar execution time. By assumption, ©s is an 
affine function of the fteration vector and the struc- 
tural parameters: ©s(i, n) = as • i + bs • n + cs ■ 

• There are m p dependencies Pi . . . P mp in the program. 
Each dependence Pj is a 4-tuple (Rj,Tj,hj,Vj) where 
Rj and Tj are statements, hj is a vector-valued affine 
function, and Vj C Vr j is a polyhedron such that: 



Vi € Vj,Rj(i) depends on Tj(hj(i,n)) (1) 

The dependencies Pj are determined using an array 
dataflow analysis, e.g., the Omega test [16]. 

• There are m a arrays Ai . . . A ma in the program, and 
A(S) denotes the array assigned to by statement 5. 
Our assumption that the data space corresponds with 
the iteration space implies that for all statements S, 
S(i) writes to location i of A(S). However, several 
statements can write to the same array. 

• With each array A we will associate an occupancy vec- 
tor va that specifies the storage reuse within A. The 
locations h and h in the original data space of A will 
be stored in the same location following our storage 
transform if and only if h = fe + k * va, for some inte- 
ger k. Given our assumption about the data space, 
we can equivalently state that the values produced 
by iterations ii and 12 will be stored in the same lo- 
cation following our storage transform if and only if 
ii = ii + k * va, for some integer k. 

4.2 Schedule Constraints 

According to dependence Pj (equation (1)), for any value of i 
in Vj , operation Rj (I) depends on the execution of operation 
Tj (hj (i, n)). Therefore, in order to preserve the semantics of 
the original program, in any new order of the computations, 
Tj (hj (i, n)) must be scheduled at a time strictly earlier than 
Rj(i), for all i €Vj. We express this constraint in terms of 
the scheduling function. We must have, for each dependence 
Pj,j<E[l,n p ]: 

Vn G A', V? G Vj , & Rj (i, n) - ® Tj (hj(i,n),n) - 1 > (2) 

These dependence constraints can be solved using Farkas' 
lemma as shown by Feautrier [7, 8, 4]. The result can be 
expressed as a polyhedron 1Z,: the set of all the legal sched- 
ules = (asj , bs 1 , csj , • • • , as„ e , bs„ e , cs„ e ) in the space of 



scheduling parameters. Note that equation (2) does not al- 
ways have a solution [7]. In such a case, one needs to use 
multidimensional schedules [8]. However, in this paper, we 
assume that equation (2) has a solution. 

4.3 Storage Constraints 

The occupancy vectors induce some storage constraints. We 
consider any array A. Because we assume that the data 
space corresponds with the iteration space, and by defini- 
tion of the occupancy vectors, the values computed by it- 
erations i and i + v a are both stored in the same location 
/. For an occupancy vector va to be valid for a given data 
object A, every operation depending on the value stored at 
location / by iteration i must execute before iteration i + va 
stores a new value at location /. Otherwise, following our 
storage transformation, a consumer expecting to reference 
the contents of / produced by iteration i could reference the 
contents of / written by iteration i + v a instead, thereby 
changing the semantics of the program. 

Let us consider a dependence P = (R,T,h,V). Then 
operation T(h(i,n)) produces a value which will be later on 
read by R(i). This value will be overwritten by T(h(i, n) + 
va(t))- The storage constraint imposes that T(h(i,n) + 
va(t)) is scheduled after R(i). Therefore, any schedule 
and any occupancy vector va(t) respects the dependence P 
if: 

Vn 6 ./V,vfe Z,Q T (h(i,n) + v A ( T ),n) - & R {i,n) - 1 > 

(3) 

where Z represents the domain over which the storage con- 
straint applies. That is, the storage constraint applies for 
all iterations i where i is in the domain of the dependence, 
and where h{i,n) + v"a{t) is in the domain of statement T. 

Formally, Z = {i \ i 6 V A h{t,n) + v A ( T ) € V T }. This 
definition of Z is not problematic, since the intersection of 
two polyhedra is defined simply by the union of the affine in- 
equalities describing each, which obviously is a polyhedron. 
Note, however, that Z is parameterized by both va(t) and 
n, and not simply by n. 

Equation (3) expresses the constraint on an occupancy 
vector for a given dependence and a given schedule. For an 
occupancy vector to be an AOV, however, it must respect all 
dependencies across all legal schedules. Thus, the following 
constraint defines a valid AOV va for each object A in the 
program: 

VG e izyn e M, Vj e [1, %,], v?e z h 

& Tj (hj (i, n) + v A ( Tj ),n)- Rj (?, n) - 1 > (4) 

4.4 Linearizing the Constraints 

Equations (3) and (4) represent a possibly infinite set of 
constraints, because of the parameters. Therefore, we need 
to rewrite them so as to obtain an equivalent but finite set 
of affine (in)equations, which we can easily solve. Mean- 
while, we seek to express the schedule (2) and storage (4) 
constraints in forms linear in the scheduling parameters O. 
This step is essential for constructing a linear program that 
minimizes the length of the AOV's. 

4.4.1 Reduction using the vertices of polyhedra 

Any nonempty polyhedron is fully defined by its vertices, 
rays and lines [17], which can be computed even in the case 



of parameterized polyhedra [14]. The following theorem ex- 
plains how we can use these vertices, rays and lines to reduce 
the size of our sets of constraints. 

Theorem 1 Let V be a nonempty polyhedron. V can be 
written V = P + C , where P is a polytope (bounded polyhe- 
dron) and C is a cone. Then any affine function h defined 
over V is nonnegative on T> if and only if 1) h is nonnega- 
tive on each of the vertices of P and 2) the linear part of h 
is nonnegative (resp. null) on the rays (resp. lines) of C . 

All the polyhedra produced by the dependence analysis 
of programs are in fact polytopes, or bounded polyhedra 
(the domain of parameters V is an input of this analysis 
and may be unbounded). Therefore, in order to simplify the 
equations, we now assume that all the polyhedra we manip- 
ulate are polytopes, except when stated otherwise. Then, 
according to Theorem 1, an affine function is nonnegative 
on a polyhedron if and only if it is nonnegative on the ver- 
tices of this polyhedron. We successively use this theorem to 
eliminate the iteration vector and the structural parameters 
from equation (3). 

4.4.2 Eliminating the Iteration Vector 

Let us consider any fixed values of O in 7?. and n in J\f. Then, 
for all j € [l,n p ], va(t-) must satisfy: 

Vi e Zj , Q Tj (hj (i, n) + va(Tj), n) - ® Rj (i, n) - 1 > (5) 

which is an affine inequation in i (as hj , ©7\ , and Q Rj are 
affine functions). Thus, according to Theorem 1, it takes its 
extremal values on the vertices of the polytope Zj , denoted 
by zij, . . . , z„ z j. Note that Zj is parameterized by n and 
va(t-)- Therefore, the number of its vertices might change 
depending on the domain of values of n and va(t-)- In this 
case we decompose the domains of n and va(t-) into subdo- 
mains over which the number and definition of the vertices 
do not change [14], we solve our problem on each of these 
domains, and we take the "best" solution. 

Thus, we evaluate (5) at the extreme points of Zj , yield- 
ing the following for each k € [l,n z ]: 



®Tj {hj (z k ,j (va( Tj ),n),n)+ v A ( Tj ) , n) 

-Q Rj {zk,j{v A ( Tj ),n),n) - 1 > 



(6) 



According to Theorem 1, equations (5) and (6) are equiva- 
lent. However, we have replaced the iteration vector i with 
the vectors zt,j, each of which is an affine form in n and 

VA(Tj)- 

4.4.3 Eliminating the Structural Parameters 

Suppose jV is also a bounded polyhedron. We eliminate 
the structural parameters the same way we eliminated the 
iteration vector: by only considering the extremal vertices 
of their domain J\f. Thus, for any fixed value of O in 7?., j 
in [l,n p ], and k in [l,n 2 ] we must have: 



Vn 6 N, & Tj {hj {z k ,j {v A ( Tj ),n),n)+ v A ( Tj ) , n) 

-©^{zkj^AiTj),^),^) - 1 > 



(7) 



Denoting the vertices of jV by («Ji, . . . ,w Um ), the above 
equation is equivalent to: 



V7 6 [l,n lu ],OT j (ftj(«*:,j(wA(T j ),W()>W'0 +V A (T j ),Wl) 

-Qn^zicj^AiT^jW^jWi) - 1 > 



(8) 



Case of unbounded domain of parameters. It might also 
be the case that J\f is not a polytope but an unbounded 
polyhedron, perhaps corresponding to a parameter that is 
input from the user and can be arbitrarily large. In this 
case, we use the general form of Theorem 1. Let r\ , . . . , r nr 
be the rays defining the unbounded portion of jV (a line 
being coded by two opposite rays). We must ensure that the 
linear part of equation (8) is nonnegative on these rays. For 
example, given a single structural parameter m € [5, oo), 
we have the the following constraint for the vertex m = 5: 

©t,- (hj (z k j (v A(Tj ) , 5) , 5) + v A(Tj ) , 5) 
-0ji j (zi ! , : ,-(iTA(T,,.),5),5) - 1 > 

and the following constraint for the positive ray of value 1: 



©T,. {hj{z k j{v AiTj ),l), 1) + v A(Tj) , 1) 

-&R j (z kJ (v A(Tj) ,l),l) 

-Q Tj (hj (z k ,j (v MT . ) , 0) , 0) + v MT . ) , 0) 

+&R j (z k j(v A(Tj) ,0),0)>0 



(9) 



Though this equation may look complicated, in practice it 
leads to simple formulas since all the constant parts of equa- 
tion (7) are going away. We assume in the rest of this paper 
that M is a polytope. This changes nothing in our method, 
but greatly improves the readability of the upcoming sys- 
tems of constraints! 

4.5 Finding a Solution 

After removing the structural parameters, we are left with 
the following set of storage constraints: 

Vj e [l,m p ],V& € [l,m z ],Vl 6 [l,ra™], 

&T j {hj{z k j{v A(Tj ),Wl),Wl) +V A(Tj ),Wl) (10) 

-©B : ,(5* : ,j(wA(T j ),W()>W'i) - 1 > 

which is a set of affine inequations in the coordinates of the 
schedule 0, with the occupancy vectors v A (t) as unknowns. 
Note that the vertices z k j of the iteration domain, the ver- 
tices wi of the structural parameters, and the components 
hj of the affine functions, all have fixed and known values. 
Similarly, we can linearize the schedule constraints to 
arrive at the following equations: 



Vj 6 [l,m p ],\/k 6 [l,m„],VZ 6 [l,m™], 
@Rj (yk,j(wi),wi) - Q Tj (hj(y kt j(wi),wi),wi) - 1 > 

Where yij, . . . ,y my ,j denote the vertices of Vj. 



(11) 



4.5.1 Finding an Occupancy Vector Given a Schedule 

At this point we have all we need to determine which oc- 
cupancy vectors (if any) are valid for a given schedule 0: 
we simply substitute into the simplified storage constraints 
(10) the value of the given schedule. Then we obtain a set of 
affine inequalities where the only unknowns are the compo- 
nents of the occupancy vector. This system of constraints 
fully and exactly defines the set of the occupancy vectors 
valid for the given schedule. We can search this space for 
solutions with any Linear Programming solver. 

To find the shortest occupancy vectors, we can use as 
our objective function the sum of the lengths of the com- 
ponents of the occupancy vector. This metric minimizes 



the "Manhattan" length of each occupancy vector instead 
of minimizing the Euclidean length. However, minimizing 
the Euclidean length would require a non-linear objective 
function. Furthermore, we have found that our linear ob- 
jective function gives on all of our examples results with the 
smallest Euclidean length. 

4.5.2 Finding a Schedule Given an Occupancy Vector 

At this point, we also have all we need to determine which 
schedules (if any) exist for a given set of occupancy vec- 
tors. Given an occupancy vector v A for each array A in 
the program, we substitute into the linearized storage con- 
straints (10) to obtain a set of inequalities where the only 
variables are the scheduling parameters. These inequalities, 
in combination with the linearized schedule constraints (11) 
completely define the space of valid affine schedules valid for 
the given occupancy vectors. Once again, we can search this 
space for solutions with any Linear Programming solverm, 
selecting the "best" schedule as in [7]. 

4.5.3 Finding the AOV's 

Solving for the AOV's is more involved. To find a set of 
AOV's, we need to satisfy the storage constraints (10) for 
any value of the schedule within the polyhedron 1Z, defined 
by the schedule constraints. To do this, we apply the Affine 
Form of Farkas' Lemma [17, 7, 4]. 

Theorem 2 (Affine. Form of Farkas ' Lemma) Let V be a 
nonempty polyhedron defined by p affine inequalities 

Sj -x + bj>0, j € [l,p], 

in a vector space £ . Then an affine form \P is nonnegative 
everywhere in T> if and only if it is an affine combination of 
the affine forms defining V: 



\/x 6 £, *(£) = A + Y^( X i («J ' S + 6 J'))' A . . . A p > 



The nonnegative constants \j are referred to as Farkas mul- 
tipliers. 

To apply the lemma, we note that the storage constraints 
are affine inequations which are nonnegative over the poly- 
hedron 1Z,. Thus, we can express each storage constraint as 
a nonnegative affine combination of the schedule constraints 
defining 1Z. 

To simplify our notation, let STORAGE be the set of ex- 
pressions that are constrained to be nonnegative by the lin- 
earized storage constraints (10). That is, STORAGE con- 
tains the left hand side of each inequality in (10). Naively, 
\STORAGE\ = mp x m z x (m w +m r ); however, several of 
these expressions might be equivalent, thereby reducing the 
size of STORAGE in practice. 

Similarly, let SCHEDULE be the set of expressions that 
are constrained to be nonnegative by the linearized schedule 
constraints (11). The size of SCHEDULE is at most m p x 
m y x (m w + m r ). 

Then, the application of Farkas' Lemma yields these 
identities across the vector space £ in which lives: 

\SCHEDULE\ 

STORAGEi = Ai, + ^ {\ij ■ SCHEDULEj) 



Xi,j > 0, \/x 6 £,V« e [1, \STORAGE\] 



A [] [] = new int [n] [m] 
B [] [] = new int [n] [m] 



A[] 
B[] 



new int [m+n] 
new int [m+n] 



for i = 1 to n 
for j = 1 to m 

A[i][j] = f(B[i-l][j]) 
B[i][j] = g(A[i][j-l]) 



(SI) 
(S2) 



for i = 1 to n 
for j = 1 to m 

A[i-j+m] = f(B[(i-l)-j+m]) 
B[i-j+m] = g(A[i-(j-l)+m]) 



(SI) 
(S2) 



Figure 7: Original code for Example 2. 
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Figure 8: Dependence diagram for Example 2. 



These equations are valid over the whole vector space £ . 
Therefore, we can collect the terms for each of the compo- 
nents of x, as well as the constant terms, setting equal the 
respective coefficients of these terms from opposite sides of 
a given equation (cf. [7, 4] for full details). We are left with 
\STORAGE\ x (3 x m s + 1) linear equations where the only 
variables are the A's and the occupancy vectors va- 

The set of valid AOV's is completely and exactly deter- 
mined by this set of equations and inequations. To find a 
good AOV, we proceed as in Section 4.5.1. 



5 Examples 

We present four examples to 
method described above. 



illustrate applications of the 



5.1 Example 1: Simple Stencil 

First we derive the solutions presented earlier for the 3-point 
stencil in Example 1. 



5.1.1 Constraints 

Let denote the scheduling function for statement Si in 
the example. We assume that is an affine form as follows: 

0(i, j, n, m) =a*i + b*j+c*n + d*m + e 

There are three dependencies in the stencil, each from Si 
unto itself. The access functions describing the dependencies 
are hi(i,j,n,m) = (i — 2,j — l), h,2(i,j, n,m) = (i,j — l), and 
h3(i,j,n,m) = (i + 1, j — 1). Because these dependencies are 
uniform-that is, they do not depend on the iteration vector- 
we can simplify our analysis by considering the dependence 
domains to be across all values of i and j. Thus, the schedule 
constraints are: 

Q(i, j,n,m) — Q(i — 2, j — 1, n, m) — 1 > 
0(i, j, n, m) — Q(i,j — l,n, m) — 1 > 
®(i, j, n, m) — ®(i + 1, j — 1, n, m) — 1 > 



Figure 9: Transformed code for Example 2. Each array has an 
AOV of (1,1). 



However, substituting the definition of into these equa- 
tions, we find that i, j, n, and m are eliminated. This is 
because the constraints are uniform. Thus, we obtain the 
following simplified schedule constraints, which are linear in 
the scheduling parameters: 

2*6 + c-l > 

c-1 > 

-b + c-1 >0 

Now let va = {vi , Vj ) denote the AOV that we are seeking 
for array A. Then the storage constraints are as follows: 

0(i — 2 + Vi, j — 1 + Vj,n,m) — Q(i,j,n, m) — 1 > 
0(i + Vi, j — 1 + Vj,n,m) — Q(i,j,n, m) — 1 > 
0(i + 1 + Vi, j — 1 + Vj,n,m) — Q(i,j,n, m) — 1 > 

Simplifying the storage constraints as we did the schedule 
constraints, we obtain the linearized storage constraints: 

b * Vi + c * Vj — 2*6 — c — 1>0 
b * Vi + c * Vj — c — 1>0 
b*Vi+c*Vj+b — c — 1>0 

5.1.2 Finding an Occupancy Vector 

To find the shortest occupancy vector for the schedule that 
executes the rows in parallel, we substitute 0(i, j, n,m) = i 
into the linearized schedule and storage constraints. Min- 
imizing Vi + Vj with respect to these constraints gives the 
occupancy vector of (0,2) (see Figure 2). 

5.1.3 Finding a Schedule 

To find the set of schedules that are valid for the occupancy 
vector of (0,2), we substitute Vi = and Vj = 2 into the 
linearized schedule and storage constraints. Simplifying the 
resulting constraints yields: 

c > 1 + 2 *b 
c> 1 -2*6 

which corresponds to the set of legal affine schedules as de- 
picted in Figure 5. 

5.1.4 Finding an AOV 

To find an AOV for A, we apply Farkas' Lemma to rewrite 
each of the linearized storage constraints as a non-negative 
affine combination of the linearized schedule constraints: 







6 * Vi + c * vj - 


-2 


*6-c- 1 








b * Vi + c * Vj — c — 1 








b*Vi+c*Vj+b — c — 1 




Ai.i 

A2,l 
A3,l 


A 

A 
A 


1,2 Al,3 Al,4 
2,2 A2,3 A2,4 
3,2 A3, 3 A3, 4 




1 
2 *6 + c- 

c-1 
-b + c-1 


1 



\i,j>0, Vi€[l,3],Vje[l,4] 



imax = a. length 
jmax = b. length 
kmax = c . length 
D[] [] [] = new int [imax] [jmax] [kmax] 

for i = 1 to imax 
for j = 1 to jmax 
for j = 1 to kmax 

if (i==l) or (j==l) or (k==l) then 

D[i][j][k] = f(i,j,k) 
else 

D[i][j][k] = 

min(D[i-l][j-l][k-l] + w(a[i],b[j],c[k]), 
D[i][j-l][k-l] + w(GAP,b[j],c[k]), 
D[i-l][j][k-l] + w(a[i],GAP,c[k]), 
D[i-l][j-l][k] + w(a[i],b[j],GAP), 
D[i-l][j][k] + w (a [i], GAP, GAP), 
D[i][j-l][k] + w(GAP,b[j],GAP), 
D[i][j][k-1] + w(GAP,GAP,c[k])) 



(SI) 
(S2) 



imax = a. length 
jmax = b. length 
kmax = c . length 
D[][] = new int [imax+ jmax] [imax+kmax] 

for i = 1 to imax 
for j = 1 to jmax 
for j = 1 to kmax 

if (i==l) or (j==l) or (k==l) then 

D[jmax+i-j] [kmax+i-k] = f(i,j,k) (SI) 

else 

D[jmax+i-j] [kmax+i-k] = (S2) 

min(D[jmax+(i-l)-(j-l)] [kmax+(i-l)-(k-l)] + w(a[i] ,b[j] ,c [k] ) , 
D[jmax+i-(j-l)] [kmax+i-(k-l)] + w(GAP,b[j] ,c [k] ) , 
D[jmax+(i-l)-j] [kmax+(i-l)-(k-l)] + w(a[i] ,GAP,c [k] ) , 
D[jmax+(i-l)-(j-l)] [kmax+(i-l)-k] + w(a[i] ,b [j] ,GAP) , 
D[jmax+(i-l)-j] [kmax+(i-l)-k] + w(a[i] , GAP, GAP) , 
D[jmax+i-(j-l) [kmax+i-k] + w(GAP,b[j] ,GAP) , 
D[jmax+i-j] [kmax+i-(k-l)] + w(GAP,GAP,c [k] ) ) 



Figure 10: Original code for Example 3, for multiple sequence 
alignment. Here / computes the initial gap penalty and w com- 
putes the pairwise alignment cost. 



Figure 11: Transformed code for Example 3, using the AOV of 
(1,1,1). The new array has dimension [imax+jmax] [imax+kmax], 
with each reference to [i][j][k] mapped to [jmax+i-j] [kmax+i-k]. 



Minimizing Vi + Vj subject to these constraints yields an 



AOV (vi 



(1, 2), which is smaller than the shortest 



UOVof (0,3) [18]. 

To transform the data space of array A according to this 
AOV v, we follow the approach of [18] and project the orig- 
inal data space onto the line perpendicular to v. Choosing 
v± = (2, —1) so that v ■ v± = 0, we transform the original 
indices of (i, j) into v± ■ (i, j) = 2 * i — j. Finally, to ensure 
that all data accesses are non-negative, we add m to the new 
index, such that the final transformation is from -A[«][j] to 
A[2 * i — j + m] . Thus, we have reduced storage requirements 
from n*m to 2*n + m. The modified code corresponding 
to this mapping is shown in Figure 6. 

5.2 Example 2: Two-Statement Stencil 

We now consider an example adapted from [12] where there 
is a uniform dependence between statements in a loop (see 
Figures 7 and 8). Letting ©i and ©2 denote the schedul- 
ing functions for statements 1 and 2, respectively, we have 
following schedule constraints: 



Qi(i,j,n,m) 
B 2 (i,j,n,m) 



©2(1 — l,j,n,m) — 1 > 
©i (i, j — 1, n, m) — 1 > 



and the following storage constraints: 



©2 (i — 1 + VB,i , j + VB,j 
©1 (i + VA,i , j — 1 + VA,j • 



n, m) — ©i (i, j, n, m) — 1 > 
n, m) — ©2 (i, j, n, m) — 1 > 



We now demonstrate how to linearize the schedule con- 
straints. We observe that the polyhedral domain of the itera- 
tion parameters (i, j) has vertices at (1, 1), (n, 1), (l,m), (n,m), 
so we evaluate the schedule constraints at these points to 
eliminate (i, j): 



0i (1,1, 71, m) - 

2 (1,1, n, m) - 
©i(l,m, n, m) 
02(l,m,n, m) 
©i(n, 1, n, m) ■ 
©2(71, 1, n, m) ■ 
Qi (n, m, n, m) 
©2(71,771,71,771) 



© 2 (0,l,n,m) -1 > 
©i(l,0,77,m) -1 > 

- ©2(1-1,,,', 77, m) -1 > 
-0i(l,j-l,77,m)-l > 

©2(77-1,1,77,777) -1 > 
■©1(77,0,77,777) -1 >0 

- ©2(77 — 1, 777,77,777) — 1 > 

- ©i (77, 777 — 1, 77, 777) — 1 > 



Next, we eliminate the structural parameters (77,777). As- 
suming 77 and 777 are positive but arbitrarily large, the do- 
main of these parameters is an unbounded polyhedron: (77,777) = 
(1, 1) +j * (0, 1) + k* (1, 0), for positive integers j and k. We 
must evaluate the above constraints at the vertex (1,1), as 
well as the linear part of the constraints for the rays (1, 0) 
and (0, 1). Doing so yields 24 equations, of which we show 
the first 3 (which result from substituting into the first of 
the equations above): 

0i (1,1, 1,1) -0 2 (O,1,1,1) -1 > 

0i (1, 1, 1, 0) - 02 (0, 1, 1, 0) - 0i (1, 1, 0, 0) + © 2 (0, 1, 0, 0) > 

0i (1, 1, 0, 1) - © 2 (0, 1, 0, 1) - 0i (1, 1, 0, 0) + © 2 (0, 1, 0, 0) > 

Expanding the scheduling functions as ©^ (i, j, n, m) = a x + 
b x * i + c x * j + d x * 77 + e x * 777, the entire set of 24 equations 
can be simplified to: 



a,\ = ai 






ei = ei 






a\ + 61 + ci + ei — 02 - 


C2 — e-i + (61 + d\ - 


62 -d 2 )n-l > 


ai + 2bi + ci + ei — a-z 


— 62 — C2 — e2 + (di 


-d 2 )77-l > 


ai + 62 + 2c2 + e2 — ai 


— 61 — ci — ei + (d,2 


-di)n-l > 


a,i + 2c2 + e2 — ai — ci 


— ei + (62 + d,2 — bi 


-dl)77-l > 



These equations constitute the linearized schedule constraints. 
In a similiar fashion, we could linearize the storage con- 
straints, and then apply Farkas' lemma to to find the short- 
est AOV's of v a = vb = (1, 1). Due to space limitations, we 
do not derive the entire solution here. The code that results 
after transformation by these AOV's is shown in Figure 9. 

5.3 Example 3: Multiple Sequence Alignment 

We now consider a version of the Needleman-Wunch se- 
quence alignment algorithm [15] to determine the cost of 
the optimal global alignment of three strings (see Figure 11). 
The algorithm utilizes dynamic programming to determine 
the minimum-cost alignment according to a cost function w 
that specifies the cost of aligning three characters, some of 
which might represent gaps in the alignment. 

Using ©i and ©2 to represent the scheduling functions 
for statements 1 and 2, respectively, we have the following 
schedule constraints (we enumerate only three constraints 



A [] [] = new int [n] [m] 




B [] = new int [n] 




for i = 1 to n 




for j = 1 to n 




A[i][j] = B[i-l]+j 


(SI) 


B[i] = A[i][n-i] 


(S2) 



A [] = new int [n] 

B = new int 

for i = 1 to n 
for j = 1 to n 

A[i] = B+j 
B - A[i] 



(SI) 
(S2) 



Figure 12: Original code for Example 4. 
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Figure 13: Dependence diagram for Example 4. 



for each pair of statements since the other dependencies fol- 
low by transitivity): 

G 2 (i, j, k, x, y, z) - ©i {i - 1, j, k, x, y, z) - 1 > 

fori = 2,j€[3,y],k€[3,z] 
Q 2 {i, j, k, x, y, z) - Oi {i, j - 1, k, x, y, z) - 1 > 

for i € [3, x], j = 2, k G [3, z] 
Q 2 (i, j, k, x, y, z) - Oi (j, j, k - 1, x, y, z) - 1 > 

for i 6 [3,x],j £ [3,y],k = 2 
B 2 (i,j,k,x,y,z) - 2 (« - l,j,k,x,y,z) - 1 > 

/orie[3,a;],je[2 ) i/] ) fce[2, 2! ] 
Q 2 {i,j,k,x,y,z) - 2 («, j - \,k,x,y,z) - 1 > 

/orie[2, a; ] ) je[3 ) i/] ) fce[2 )2! ] 

Q 2 {i,j,k,x,y,z) - Q 2 (i,j,k- \,x,y, z) - 1 > 
/orie[2, a; ],je[2 ) i/] ) fce[3,a!] 

Note that each constraint is restricted to the subset of the 
iteration domain under which it applies. That is, S 2 depends 
on Si only when i, j, or k is equal to 2; otherwise, S 2 
depends on itself. This example illustrates the precision of 
our technique for general dependence domains. 
The storage constraints are as follows: 

02 (« - 1 +Vi,j +Vj,k + v k ,x,y,z) - ® 2 (i,j,k,x,y,z) - 1 

>0/orie[3,4j6[2,!/],*ep,«] 
02(* + Vi, j - 1 +Vj,k + v k ,x,y, z) - & 2 {i,j,k,x,y,z) - 1 

>0/ori6 [2,x],j<E[3,y],k<E [2, z] 
&2(i + Vi,j +Vj,k - 1 +v k ,x,y,z) - & 2 {i,j,k,x,y,z) - 1 

>0/ori6 [2,x],j€[2,y],k€ [3, z] 

There is no storage constraint corresponding to the depen- 
dence of S 2 on Si because the domain 2 of the constraint 
is empty for occupancy vectors with positive components, 
and occupancy vectors with a non-positive component do 
not satsify the above constraints. That is, for the first 
dependence of S2 on Si, the dependence domain is V = 
{(2, j, k) \ j € [3,y] A k 6 [3, z]} while the existence domain 
ofSiisD Sl ={(i,j,k) |i6 [l,z]Aj6 [l,j/]Afce [l,z]A(i = 
lVj = lVfc = 1)}. Then, the domain of the first storage con- 
straint is Z = {(i,j,k) j (i,j,k) 6 VA(i— 1, j, k)+VA € ©Si}- 



Figure 14: Transformed code for Example 4. The AOV's for A 
and B are (1,0) and 1, respectively. 



Now, Z is empty given that va has positive components, be- 
cause if (i, j, k) € V then i = 2, but if (i — 1, j, k) +va € X>Si 
then i — 1 + iu,i = 1, or equivalently i + va,% = 2. Thus for 
Z to be non-empty, we would have 2 + va { = 2, which con- 
tradicts the positivity assumption on VA,i- The argument is 
analogous for other dependencies of S2 on Si . 

Applying our method for this example yields an AOV of 
(1, 1, 1). The transformed code under this occupancy vector 
is just like the original, except that the array is of dimension 
[imax+jmax][imax+kmax] and element [i][j][k] is mapped to 
[jmax+i-j] [kmax+i-k]. 

5.4 Example 4: Non-Uniform Dependencies 

Our final example is constructed to demonstrate the ap- 
plication of our method to non-uniform dependencies (see 
Figures 12 and 13). Let Oi and 02 denote the scheduling 
functions for statements Si and S2, respectively. Then we 
have the following schedule constraints: 

&i(i,j,n) -0 2 (i-l,n)-l >0 
02 (i, n) — 0i (i, n — i, n) — 1 > 

and the following storage constraints: 

Q 2 (i-l + VB,n)-Qi(i,j,n)-l >0 

0i (i + VA,i,n — i + VA,j,n) — & 2 {i, n) — 1 > 

Applying our method to these constraints yields the AOV's 
va = (1, 0) and vb = 1. The transformed code is shown in 
Figure 14. 

6 Results 

We performed preliminary experiments that validate our 
technique as applied to two of our examples. The tests were 
carried out on an SGI Origin 2000, which uses MIPS R10000 
processors with 4MB L2 caches. 

For Example 2, the computation was divided into di- 
agonal strips. Since there are no data dependencies be- 
tween strips, the strips can be assigned to processors with- 
out requiring any synchronization [12]. Figure 15 shows the 
speedup gained on varying numbers of processors using both 
the original and the transformed array. Both versions show 
the same trend and do not significantly improve past 16 
processors, but the transformed code has an advantage by a 
sizable constant factor. 

Example 3 was parallelized by blocking the computation, 
and assigning rows of blocks to each processor. As shown 
in Figure 16, the transformed code again performs substan- 
tially better than the original code. With the reduced work- 
ing set of data in the transformed code, the speedup is super- 
linear in the number of processors due to improved caching. 



Example 2 Speedup 
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Figure 15: Speedup vs. number of processors for Example 2. 
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Figure 16: Speedup vs. number of processors for Example 3. 



7 Related Work 

The work most closely related to ours is that of [18], which 
considers schedule-independent storage mappings using the 
Universal Occupancy Vector (UOV). While an AOV is valid 
only for affine schedules, a UOV is valid for any legal execu- 
tion ordering. There are a number of advantages of AOV's 
over UOV's. Testing whether a given occupancy vector is a 
UOV is an NP-compete problem, and [18] gives a branch- 
and-bounds algorithm to find the smallest UOV. In contrast, 
an AOV can be quickly checked by determining the feasibil- 
ity of a set of linear constraints, and we employ numeri- 
cal programming techniques to efficiently find short AOV's. 
Moreover, the analysis of [18] is limited to a stencil of de- 
pendencies involving only one statement within a perfectly 
nested loop, whereas our method applies to general affine 
dependencies across statements and loop nests. Also, some- 
times there are AOV's shorter than any UOV since the AOV 
must be valid for a smaller range of schedules. Finally, our 
framework goes beyond AOV's to unify the notion of occu- 
pancy vectors with known affine scheduling techniques. 

Another related approach to storage management for 
parallel programs is that of [3, 2, 11]. Given an affine sched- 
ule, [11] optimizes storage first by restricting the size of each 
array dimension and then by combining distinct arrays via 
renaming. This work is extended in [3, 2] to consider storage 
mappings for a set of schedules, towards the end of capturing 
the tradeoff between parallelism and storage. 

However, these techniques utilize a storage mapping where, 
in an assignment, each array dimension is indexed by a 
loop counter and is modulated independently (e.g. A [i mod 
n] [j mod m]). This is distinct from the occupancy vec- 
tor mapping, where the data space of the array is pro- 
jected onto a hyperplane before modulation (if any) is in- 
troduced. The former mapping-when applied to all valid 
affine schedules-does not enable any storage reuse in Exam- 
ples 2 and 3, where the AOV did. However, with occupancy 
vectors we can only reduce the dimensionality of an array 
by one, whereas the other mapping can introduce constant 
bounds in several dimensions. We hope to extend the occu- 
pancy vector method in this capacity in the future. 

Memory reuse in the context of the polyhedral model 
is also considered in [19]. This approach uses yet another 
storage mapping, which utilizes array transformations on 
the data space to achieve the effect of multiple occupancy 
vectors applied at once. However, the mapping does not 
have any modulation, so it could not duplicate the effect of 
the (2, 0) occupancy vector we found (for a given schedule) in 



Example 1. Unifying our framework with the data mapping 
of [19] could be a fruitful direction for future research. 

8 Conclusion 

We have presented a mathematical framework that unifies 
the techniques of affine scheduling and occupancy vector 
analysis. Within this framework, we showed how to deter- 
mine the best storage mapping for a given schedule, the best 
schedule for a given storage mapping, and the best storage 
mapping that is valid for all legal schedules. Our technique is 
general and precise, allowing inter-statement affine depen- 
dencies and efficiently solving for the minimal occupancy 
vector using standard numerical programming methods. 

We consider this research to be the first step towards 
automating a procedure that finds the optimal tradeoff be- 
tween parallelism and storage space. This question is very 
relevant in the context of array expansion, where the cost of 
extra array dimensions must be weighed against the schedul- 
ing freedom that they provide. Additionally, our framework 
could be applied to single-assignment functional languages 
where all storage reuse must be orchestrated by the com- 
piler. In both of these applications, and even for compil- 
ing to uniprocessor systems, understanding the interplay be- 
tween scheduling and storage is crucial for achieving good 
performance. 

In future work, we hope to consider more general storage 
mappings. The occupancy vector method as it stands now 
can only decrease the dimensionality of an array by one, 
and the irregular shape of the resulting data space could be 
hard to embed in a rectilinear array in a storage-efficient 
way. However, other storage mappings [11, 19] we discussed 
also have their limitations. The perfect storage mapping 
would allow variations in the number of array dimensions, 
while still capturing the directional and modular reuse of the 
occupancy vector and having an efficient implementation; 
it should also lend itself to efficient storage reuse between 
distinct arrays. 

Additionally, we plan to extend our method to multi- 
dimensional schedules, and to consider integrating our method 
with affine partitioning techniques [13]. Finally, it could be 
useful to incorporate cache behavior into our model, as false 
sharing effects may be amplified in parallel systems as the 
size of the store decreases. 
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